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CONSISTENCY OF BAYES ESTIMATORS OF A BINARY 
REGRESSION FUNCTION 1 

By Marc Coram and Steven P. Lalley 

University of Chicago 

When do nonparametric Bayesian procedures "overfit"? To shed 
light on this question, we consider a binary regression problem in de- 
tail and establish frequentist consistency for a certain class of Bayes 
procedures based on hierarchical priors, called uniform mixture pri- 
ors. These are defined as follows: let v be any probability distribution 
on the nonnegative integers. To sample a function / from the prior 
■k" , first sample m from v and then sample / uniformly from the set 
of step functions from [0, 1] into [0, 1] that have exactly m jumps (i.e., 
sample all m jump locations and m + 1 function values independently 
and uniformly). The main result states that if a data-stream is gen- 
erated according to any fixed, measurable binary-regression function 
/o ^ 1/2, then frequentist consistency obtains: that is, for any v with 
infinite support, the posterior of tt" concentrates on any L 1 neigh- 
borhood of /o . Solution of an associated large-deviations problem is 
central to the consistency proof. 

1. Introduction. 

1.1. Consistency of Bayes procedures. It has been known since the work 
of Freedman [7] that Bayesian procedures may fail to be consistent in the 
frequentist sense: For estimating a probability density on the natural num- 
bers, Freedman exhibited a prior that assigns positive mass to every open 
set of possible densities, but for which the posterior is consistent only at 
a set of the first category. Freedman's example is neither pathological nor 
rare: for other instances, see [4, 8, 10] and the references therein. 

Frequentist consistency of a Bayes procedure here will mean that the pos- 
terior probability of each neighborhood of the true parameter tends to 1. 
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The choice of topology may be critical: For consistency in the weak topol- 
ogy on measures, it is generally enough that the prior should place positive 
mass on every Kullback-Leibler neighborhood of the true parameter [22], 
but for consistency in stronger topologies, more stringent requirements on 
the prior are needed — see, for example, [1, 9, 23]. Roughly, these demand not 
only that the prior charge Kullback-Leibler neighborhoods of the true pa- 
rameter, but also that it not be overly diffuse, as this can lead to overfitting. 
Unfortunately, it appears that in certain nonparametric function estimation 
problems, the general formulation of this latter requirement for consistency 
in [1] is far too stringent (see the discussion in Section 1.3 below), as it 
rules out large classes of useful priors for which the corresponding Bayes 
procedures are in fact consistent. 

1.2. Binary regression. The purpose of this paper is to examine in detail 
the consistency properties of Bayes procedures based on certain hierarchical 
priors in a nonparametric regression problem. For mathematical simplicity, 
we shall work in the setting of binary regression, with covariates valued in the 
unit interval [0, 1], and we shall limit consideration to the uniform mixture 
priors defined below. The approach we develop can, however, be adapted 
to a variety of function estimation problems in one dimension (and perhaps 
in higher dimensions as well) and to other classes of hierarchical priors. In 
Section 1.4 below we provide a brief template of the approach. 

Consistency of Bayes procedures in binary regression has been studied 
previously by Diaconis and Freedman [5, 6] for a class of priors — suggested 
by de Finetti — that are supported by the set of step functions with discon- 
tinuities at dyadic rationals. The use of such priors may be quite reasonable 
in circumstances where the covariate is actually an encoding (via binary 
expansion) of an infinite sequence of binary covariates. However, in applica- 
tions where the numerical value of the covariate represents a real physical 
variable, the restriction to step functions with discontinuities only at dyadics 
is highly unnatural; and simulations show that when the regression function 
is continuous, the concentration of the posterior may be quite slow. 

Coram [3] proposed another class of priors, which we shall call uniform 
mixture priors, on step functions. These are at once mathematically natu- 
ral, allow computationally efficient simulation of posteriors, and appear to 
have much more favorable concentration properties for data generated by 
continuous binary regression functions than do the Diaconis-Freedman pri- 
ors. In simulation experiments [3] the posterior mean of a uniform mixture 
prior had noticeably smaller MSE on average than CART estimates. Its per- 
formance was similar to bagged CART, but with slightly smaller MSE on 
average. See Figure 1 for an example. 

The uniform mixture priors tt" , like those of Diaconis and Freedman, 
are hierarchical priors parametrized by probability distributions v on the 
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Fig. 1. Simulation example: The data is simulated with 1024 random x -values and 
Bernoulli y's whose success probability is the true binary regression function, the thick 
gray curve. The thin solid curve is the posterior mean of the uniform mixture prior with v 
chosen to be Geometric(^) . For comparison, the dotted line is cross-validated CART and 
the dash-dotted line is bagged CART. The white and gray histograms at the bottom show 
the raw data. 

nonnegative integers. A random step function with distribution ix v can be 
obtained as follows: (1) Choose a random integer M with distribution v. 
(2) Given that M = to, choose m points Ui at random in [0, 1] according to 
the uniform distribution: these are the points of discontinuity of the step 
function. (3) Given M = to, and the discontinuities m, choose the m + 1 step 
heights Wj by sampling again from the uniform distribution. The uniform 
sampling in steps (2)-(3) allows for easy and highly efficient Metropolis- 
Hastings simulations of posteriors; the uniform distribution could be re- 
placed by other distributions in either step, at the expense of some efficiency 
in posterior simulations (and our main theoretical results could easily be ex- 
tended to such priors), but we see no compelling reason to discuss such 
generalizations in detail. 

Let / be a binary regression function on [0, 1], that is, a Borel-measurable 
function / : [0, 1] — > [0, 1] . We shall assume that under Pf the data (Xi, Yi) are 
i.i.d. random vectors, with Xi uniformly distributed on [0,1] and Yi, given 
Xi = x, is Bernoulli-/^). (Our main result would also hold if the covariate 
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distribution were not uniform but any other distribution giving positive mass 
to all intervals of positive length.) Let Q v = j Pf dir u , and denote by Q u {-\J r n ) 
the posterior distribution on step functions given the first n observations 
(Xi,Yi) [more precisely, the regular version of the conditional distribution 
defined by (17) below]. The main result of the paper is as follows. 

Theorem 1. Assume that the hierarchy prior v is not supported by a 
finite subset of the integers. Then for every binary regression function f 
the Q v -Bayes procedure is L l -consistent at f , that is, for every e > 0, 

(1) Hm Pf{Q v {{g : \\g - /||i > e}\T n ) > e} = 0. 

The restriction / ^ 1/2 arises for precisely the same reason as in [6], 
namely, that this exceptional function is the prior mean of the regression 
function. See [6] for further discussion. 

Theorem 1 implies that the uniform mixture priors enjoy the same con- 
sistency as do the Diaconis-Freedman priors [6]. This is not exactly unex- 
pected, but neither should it be considered a priori obvious — as the proof 
will show, there are substantial differences between the uniform mixture 
priors and those of Diaconis and Freedman: In particular, since the uniform 
mixture priors allow the step-function discontinuities to arrange themselves 
in favorable (but atypical for uniform samples) configurations vis-a-vis the 
data, the danger of overfitting would seem, at least a priori, greater than 
for the Diaconis-Freedman priors. The bulk of the proof (Sections 4-5) will 
be devoted to showing that such overfitting does not occur, except possibly 
when / = 1/2. 

Theorem 1 asserts only weak convergence (i.e., in Pj-probability). In fact, 
the arguments can be extended to establish almost sure convergence, but at 
the expense of added complication. This would involve replacing the subad- 
ditive WLLN in Appendix A by a corresponding strong law, and modifying 
those arguments in Section 4 used to verify the hypotheses of the subadditive 
WLLN. 

1.3. Relation to other work. There is a substantial literature on the con- 
sistency of Bayes procedures, much of it devoted to establishing sufficient 
conditions. See, for a start, [1, 10, 22, 25, 26] and the references therein. 
Certain of the sufficient conditions developed in these papers apply, at least 
in principle, to hierarchical priors of the type considered here. Unfortu- 
nately, these conditions require that the prior be highly concentrated on 
low- complexity models. For instance, the main result of [1] would require 
for the uniform mixture priors that the hierarchy prior v satisfy 

E Vk<m- mC 

k>m 
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for some C > (see [3]). Recent results of Walker [25] improve those of [1], 
but, for a uniform mixture, priors evidently still require that v have an ex- 
ponentially decaying tail ([25], Section 6.3). Such restrictions on the tail of 
the hierarchy prior certainly prevent the accumulation of posterior mass on 
models that are overfit, but at the possible cost of having the posterior favor 
models that are underfit. Preliminary analysis seems to indicate that, when 
the true regression function is smooth, more rapid posterior concentration 
takes place when the hierarchy prior has a rather long tail. One objective 
of this paper is to show that, in at least one model of real statistical inter- 
est, the problem of finding the right conditions for frequentist consistency 
of Bayesian procedures requires a careful analysis of an associated large 
deviations problem. 

1.4. Overfitting and large deviations problems. The possibility of over- 
fitting is tied up with a certain large deviations problem connected to the 
model: this is the most interesting mathematical feature of the paper. (A 
similar large deviations problem occurs in [6], but there it reduces easily 
to the classical Cramer LD theorem for sums of i.i.d. random variables.) 
Roughly, we will show in Section 4 that, as the sample size n — > oo, the 
posterior probability of the set of step functions with more than an dis- 
continuities decays like e n ^^ a \ where ijj(a) < 0. Then, in Section 5, we will 
show that ip(a) is uniquely maximized at a — > 0; this will imply that, for 
large sample size n, most of the posterior mass is concentrated on step func- 
tions with a small number of discontinuities relative to n. Concentration of 
the posterior in /^-neighborhoods of the true regression function will then 
follow by routine arguments — see Section 3. 

We expect (and hope to show in a subsequent paper) that in a variety of 
problems, for certain classes of hierarchical priors, the critical determinant 
of the consistency of Bayes procedures will prove to be the rate functions 
in associated large deviations problems. The template of the analysis is as 
follows: Let 

oo 

(2) 7T = TT U = ^2 V m 1T m 

m=0 

be a hierarchical prior obtained by mixing priors 7r m of "complexity" m. Let 
Q and Q m be the probability distributions on the space of data sequences 
gotten by mixing with respect to tt and 7r m , respectively; and let Q(- 1 JF^) and 
Qm('\F n ) be the corresponding posterior distributions given the information 
in the cr-field J- n . Then by Bayes' formula, 
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where Z m ^ n are the predictive probabilities for the data in J- n based on the 
model Q m (see Section 2.2 for more detail in the binary regression prob- 
lem). This formula makes apparent that the relative sizes of the predictive 
probabilities Z m ^ n determine where the mass in the posterior Q L> {-\T n ) is 
concentrated. The large deviations problem is to show that as m,n — > oo in 
such a way that m/n — > a, 



show that ip(a) is uniquely maximized at a = 0. This, when true, will imply 
that most of the posterior mass will be concentrated on models with small 
complexity m relative to the sample size n, where overfitting does not occur. 

1.5. Choice of topology. The use of the L 1 -metric (equivalently, any 
L p -metric, < p < oo) in measuring posterior concentration, as in (1), al- 
though in many ways natural, may not always be appropriate. Posterior 
concentration relative to the Z^-metric justifies confidence that, for a new 
random sample of individuals with covariates uniformly distributed on [0,1], 
the responses will be reasonably well-predicted by regression function sam- 
ples from the posterior, but it would not justify similar confidence for a 
random sample of individuals all with covariate (say) x = 0.47. For this, 
posterior concentration in the sup-norm metric would be required. We do 
not yet know if consistency holds in the sup-norm metric, for either the uni- 
form mixture priors or the Diaconis-Freedman priors, even for smooth /; 
but we conjecture that it does. 

2. Preliminaries. 

2.1. Data. A (binary) regression function is a Borel measurable function 
/ : J — ► [0,1], where J is an interval. Most often the interval J will be the 
unit interval. For each binary regression function /, let Pf be a probability 
measure on a measurable space supporting a data stream {(X n , Y n )} n >i such 
that under Pf the random variables X n are i.i.d. Uniform- [0, 1] and, condi- 
tional on a({X n } n >i), the random variables Y n are independent Bernoullis 
with conditional means 



(In several arguments below it will be necessary to consider alternative distri- 
butions F for the covariates X n . In such cases we shall adopt the convention 
of adding the subscript F to relevant quantities; thus, for instance, Pf t p will 
denote a probability distribution under which the covariates X n are i.i.d. 
F, and the conditional distribution of the responses Y n is the same as un- 
der Pf.) We shall assume when necessary that probability spaces support 




(5) 
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additional independent streams of uniform and exponential r.v.s (and thus 
also Poisson processes), so that auxiliary randomization is possible. Generic 
data sets [values of the first n pairs (xi,yi)] will be denoted (x,y) or (x, y) n 
to emphasize the sample size; the corresponding random vectors will be de- 
noted by the matching upper case letters (X,Y). For any data set (x, y) 
and any interval J C [0, 1], the number of successes (yi = 1), failures (t/j = 0) 
and the total number of data points with covariate X{ € J will be denoted 



In certain comparison arguments, it will be convenient to have data 
streams for different regression functions defined on a common probability 
space (fi, J 7 , P). This may be accomplished by the usual device: Let {X n } n >i 
and {Vn} n >i be independent, identically distributed Uniform- [0, 1] random 
variables, and set 



2.2. Priors on regression functions. The prior distributions on regression 
functions considered in this paper are probability measures on the set of step 
functions with finitely many discontinuities. Points of discontinuity, or split 
points, of step functions will be denoted by Ui, and step heights by W{. 
Each vector u = (u±, u% . . . , u m ) of split points induces a partition of the 
unit interval into m + 1 subintervals (or cells) Ji = Jj(u). [Note: We do not 
assume that split point vectors (ui,U2, . . . ,u m ) are ordered.] Denote by 7r u 
the probability measure on step functions with discontinuities U{ that makes 
the step height random variables Wi [i.e., the values Wi on the intervals Ji(u)] 
independent and uniformly distributed on [0, 1]. For each nonnegative integer 
m, define ir m to be the uniform mixture of the measures tt u over all split 
point vectors u of length m, that is, 



It is, of course, possible to mix against distributions G other than the uni- 
form, and in some arguments it will be necessary for us to do so: in such 
cases (see, e.g., Section 2.3) we shall use additional subscripts G on vari- 
ous objects to indicate that the split point vectors are gotten by sampling 
from G instead of from the uniform distribution U. The priors of primary 
interest — those considered in Theorem 1 and equation (2) — are mixtures 
tt u = J2m v m K m of the measures 7r m against hierarchy priors v on the non- 
negative integers N. Unless otherwise stated, assume throughout that the 
hierarchy prior v is not supported by a finite subset of N. 



by 



N S (J),N F (J) and N(J) = N s (J) + N F (J). 



(6) 



Yf = l{V n <f(X n )}. 



(7) 
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Each of the probability measures 7r u , 7r m and ir u induces a corresponding 
probability measure on the space of data sequences by mixing: 



(8) Qu = J Pfdir u (f), 

(9) Qm = J Pfd7T m (f) 

and 

(10) Q» = JPfd^(f). 

Observe that Q m is the uniform mixture of the measures Q u over split point 
vectors u of size m, and Q v is the i^-mixture of the measures Q m . 

For any data sample (x, y), the posterior distribution Q(-|(x,y)) under 
any of the measures Qui Qm or Q u is the conditional distribution on the 
set of step functions given that (X, Y) = (x, y). The posterior distribution 
Q u ("|(x,y)) can be explicitly calculated: it is the distribution that makes the 
step height r.v.s W{ independent, with ~Bet&-(Nf , Nf) distributions, where 
Nf = N s (Ji(u)) and Nf = N F (Ji(u)) are the success/failure counts in the 
intervals Jj of the partition induced by u. Thus, the joint density of the step 
heights (relative to product Lebesgue measure on the cube [0, l] m+1 ) is 

(11) (?u (w|(x,y)) = Z u (x,y)- 1 n^f 1 (1 

where the normalizing constant Z u (x, y), henceforth called the Q u -predictive 
probability for the data sample (x, y), is given by 

771 771. 

(12) Z u (x,y)=/ J[ W f(l-w l ) N ?dw = l{B(N?,N l F ) 



1 = 8=0 



and 



m + n 
m 



(13) B(m,n) = ((m + n + l) 

(This is not the usual convention for the arguments of the beta function, but 
will save us from a needless proliferation of +ls.) The posterior distributions 
Q m (-\(x,y)) and corresponding predictive probabilities Z m (x, y) are related 
to Q u ( - |(x,y)) and Z u (x,y) as follows: 

(14) Q m (.|( X) y)) = ( / g u (.|(x,y))Z u (x,y)d(u)}/z ro (x,y) I 

Uu6[0,l] m J ' 
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where 

Z m (x,y)= Z u (x,y)d(u) 
Jue(o,i) m 

(15) 



]jB(Nf,Nf)d(u). 

ue(o,i)- i=Q 

(Note: The dependence of the integrand on u, via the values of the suc- 
cess/failure counts Nf,Nf, is suppressed.) In general, the last integral can- 
not be evaluated in closed form, unlike the integral (12) that defines the 
Qu-predictive probabilities. This, as we shall see in Sections 4-5, will make 
the mathematical analysis of the posteriors considerably more difficult than 
the corresponding analysis for Diaconis-Freedman priors. 

Note for future reference (Section 5) that the predictive probabilities Z m 
are related to likelihood ratios dQ m / dPf\ In particular, when / = p is con- 
stant, 

(16) Z m ((X,Y) n )=p NS (l-p) NF (^^ > 

where J~n is the <r-algebra generated by the first n data points, and N s ,N F 
are the numbers of successes and failures in the entire data set (X,Y) n . 
Finally, observe that the posterior distribution Q"(-|(x,y)) is related to the 
posterior distributions Q m (-|(x,y)) by Bayes' formula, 

(17) Q"(-|(x,y)) = j £ i/ m Z m (x,y)Q m (-|(x,y))l /if v m Z m (x,y)\. 

lm=0 J ' lm=0 J 

The goal of Sections 4-5 will be to show that, for large samples (X,Y) n , 
under Pf the predictive probabilities Z an ((X, Y) n ) are of smaller exponen- 
tial magnitude for a > than for a = 0. This will imply that the posterior 
concentrates in the region m<^.n, where the number of split points is small 
compared to the number of data points. 



Caution. Note that ir m and n u have different meanings, as do Z m and 
Z u , and Q u and Q m . The reader should have no difficulty discerning the 
proper meaning by context or careful examination of the fonts. 

2.3. Transformations of the covariates. We have assumed that under Pf 
the covariates X n are uniformly distributed on [0, 1] ; and, in constructing 
the priors 7r m and ir u , we have used uniform mixtures on the locations of 
the split points Uj. This is only for convenience: the covariate space could be 
relabeled by any homeomorphism without changing the nature of the esti- 
mation problem. Thus, if the data sample (x,y) were changed to (G? _1 x,y), 
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where G is a continuous, strictly increasing c.d.f., and if G-mixtures rather 
than uniform mixtures were used in building the priors, then the predictive 
probabilities would be unchanged: 



where Z m = Z m ^G are the predictive probabilities for the transformed data 
relative to priors ttq built using G- mixtures instead of uniform mixtures. 
(This follows directly from the transformation formula for integrals.) 

In a number of arguments below it will be necessary to consider data 
streams (X, Y) distributed according to Pf,F, where J 7 is a distribution 
other than the uniform distribution on [0, 1]. First, if (x,y) n is a data sam- 
ple of size n, with covariates xi £ [0,1], and if 67 is the c.d.f. of the uni- 
form distribution on the interval [0, n] (so that G~ l is just multiplication 
by n), then applying the transformation G~ l has the effect of standardiz- 
ing the spacings between data points and between split points. This will be 
used in Section 5. Second, if the data stream {(X n ,Y n )} n >i is subjected to 
thinning — for instance, remove a data point (X n ,Y n ) from the stream with 
probability p(X n ) depending on the value of the covariate X n — then the re- 
sulting thinned data stream will be distributed according to Pf f, where F 
has density proportional to 1 — p. The comparison arguments in Section 4.6 
below will rely on this fact. 

2.4. Self- similarity. The key to analyzing the predictive probability (15) 
is that the integral in (15) has an approximately self-similar structure: it 
almost (but not exactly!) factors as the product of two integrals each of 
the same form, one over the data and split points in [0, 1 /2] , the other over 



Here Gq,G\ are the uniform distributions on [0, 0.5] and [0.5, 1] , respectively, 
and (X',Y') and (X",Y") are the subsets of the data set (X,Y) with co- 
variates in [0,0.5] and [0.5,1], respectively. Unfortunately, the factorization 
is not exact, for two reasons: (i) the split point vectors u in the integral (15) 
do not necessarily include a split point at 1/2; and (ii) the number of split 
points in (0,1/2) is not exactly m/2. Nevertheless, when m is large, most 
split point vectors u will include a split very near 1/2, and will put about 
m/2 splits in each of (0, 1/2) and (1/2, 1), and so it is not unreasonable to 
expect that (19) should hold approximately. 

Consider the two factors Z 1 ', Z" in the approximation (19). If the sample 
size n is large, then each of the subsamples (X', Y') and (X", Y") should 
contain about n/2 points. Furthermore, if the true regression function / =p 
is constant, then, under P p each of the subsamples should, after covariate 



(18) 



Z m:G (G x,y) = Z m (x,y) 



(1/2,1]: 
(19) 



Zm(X-, Y) Ri Z m /2,GoC^'> Y')^m/2,Gi (X", Y"). 
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transformation by x i— > 2xmodl, be distributed as a sample of size (about) 
n/2 under P p . Therefore, by (18), the factors in (19) are under P p indepen- 
dent random variables, each with the same distribution as Z m / 2 ((X, Y) n / 2 ) 
under P p . 

Iteration of this factorization exhibits the predictive probability Z m (K, Y) 
approximately as a product of a large number of independent, identically dis- 
tributed factors. (Note, though, that the errors in these approximations may 
accumulate exponentially in the number of iterations; this will be handled by 
the use of subadditivity arguments in Section 4.) In essence, the exponential 
decay (4) follows from this approximate product representation. 

2.5. Beta function and Beta distributions. Because the posterior distri- 
butions (11) of the step height random variables are Beta distributions, 
certain elementary properties of these distributions and the corresponding 
normalizing constants B(n,m) will play an important role in the analysis. 
The behavior of the Beta function for large arguments is well understood, 
and easily deduced from Stirling's formula. Similarly, the asymptotic be- 
havior of the Beta distributions follows from the fact that these are the 
distributions of uniform order statistics: 

Beta concentration property. For each e > 0, there exists k(e) < 
oo such that, for all index pairs (m,n) with weight m + n> k(e), (a) the 
Beta-(m, n) distribution puts all but at most e of its mass within e of raj (m + 
n); and (b) the normalization constant B(m,n) satisfies 

, . log B(m,n) ( m \ 

20 B k ' ' +H[ <e, 

m + n \m + n J 

where H(x) is the Shannon entropy, defined by 

(21) H(x) = — xlogx — (1 — x) log(l — x). 

Note that the binomial coefficient in (13) is bounded above by 2 m+n , so 
it follows that B(m,n) > 4 _m_n . Thus, by (15), for any data sample (x,y) 
of size n, 

(22) Z m (x,y)>4"". 

Some of the arguments in Section 4 will require an estimate of the effect 
on the integral (15) of adding another split point. This breaks one of the 
intervals J, into two, leaving all of the others unchanged, and so the effect 
on the integrand in (15) is that one of the factors B(Nf ,Nf) is replaced 
by a product of two factors B(Nl,N[)B(N^,N^), where the cell counts 
satisfy 

N s + N s = N s 
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and 



NE + N$ = Nf. 



The following inequality, which is easily deduced from (13), shows that the 
multiplicative error made in this replacement is bounded by the overall sam- 
ple size: 

(23) B{Nf,N?) <N S + N F 



2.6. The entropy functional. We will show, in Section 3 below, that in 
the "Middle Zone," where the number of split points is large but small 
compared to the number n of data points, the predictive probability decays 
at a precise exponential rate as n — ► oo. The rate is the negative of the 
entropy functional H(f), defined by 

(24) H{f)= #(/(*)) dx, 

Jo 

where H(x) for x £ (0, 1) is the Shannon entropy defined by (21) above. The 
Shannon entropy function H{x) is uniformly continuous and strictly concave 
on [0,1], with second derivative bounded away from 0; it is strictly positive 
except at the endpoints, and 1; and it attains a maximum value of log 2 
at x = 1/2. The entropy functional H(f) enjoys similar properties: 



Entropy continuity property. For each e>0, there exists 5 > so 
that 

(25) \\f-9\\l<8 => \H(f)-H{g)\<e. 



Entropy concavity property. Let f and g be binary regression 
functions such that g is an averaged version of f in the following sense: 
There exist finitely many pairwise disjoint Borel sets B{ such that {x : g(x) ^ 
f(x)} = (JjBj, and for each i such that \Bj\ > 0, 

(26) g(x)= [ f(y)dy/\B i \ VxG^. 
Then 

(27) H(g) - H(f) > - (max H"(pfj \\f - gh/2. 
Hence, H(g) > H(f) with strict inequality unless f = g a.e. 
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Proof. The Continuity property (25) follows from the uniform continu- 
ity of the Shannon function H(p) by an elementary argument. The Concavity 
property follows from the Jensen inequality, and the "uniform" strengthen- 
ing (27) from the fact that H"(p) is bounded away from zero on [0,1]. To 
see this, let B = {x : f(x) / g(%)}, and let — C < be an upper bound for 
H"{p). The hypothesis (26) implies that g is constant on Bi, and equal to 
the average aj of / on this set. By Taylor's theorem, on Bi, 



the last inequality following because </,<?< 1. □ 

The Continuity and Concavity properties will be used principally to es- 
timate the entropies of step functions g near /. In particular, they allow 
entropy comparisons between a binary regression function / and step func- 
tions obtained by averaging / on the intervals of a partition. Let u be a 
vector of split points, and let Jj = Jj(u) be the intervals in the partition of 
[0, 1] induced by u. For each binary regression function /, define/ u to be the 
step function whose value on each interval Jj(u) is the mean value fj.f/\Ji\ 
of / on that interval. Then by the Concavity property, 



with strict inequality unless / = / u a.e. Moreover, the difference is small if 
and only if / and / u are close in L . This will be the case if all intervals Jj 
of the partition are small: 

Lemma 1. For each binary regression function f and each e > 0, there 
exists 5 > such that if\Ji\ <6 for every interval Ji in the partition induced 
by u ; then 



Proof. First, observe that the assertion is elementary for continuous 
regression functions, since continuity implies uniform continuity on [0,1]. 
Second, recall that continuous functions are dense in L^O, 1] by Lusin's 
theorem; thus, for each regression function / and any r/ > 0, there exists a 




H{f)-H{g)= [ H"(((x))(f(x)-g(x)) 2 dx/2 
Jb 

<-C\\f-g\\ 2 2 /2 
<-C\\f-g\\i/2, 



(28) 



H(f u ) > H(f) 



(29) 



||/-/u||i<e. 
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continuous function g : [0, 1] — > [0, 1] such that ||/ — g\\\ < r\. It then follows by 
the elementary inequality | Jh\ < J \ h\ that, for any vector u of split points, 

||/u-5u||i <V- 

Finally, use r) = e/3 and choose 5 so that, for the continuous function g and 
any u that induces a partition whose intervals are all of length < 5, 

||g-<7u||i <V- 
Then by the triangle inequality for L 1 , 

11/ - /u||i < /Hi + ||g-ffu||i + ||/u-3u||i 

< 3r? = e. □ 

2.7. Empirical distributions under Pf. Theorem 1 will be proved by 
showing (A) that, for large n, the posterior mass of the step functions with 
more than en discontinuities is of smaller order of magnitude than that of 
the step functions with fewer than en discontinuities; and (B) that the pos- 
terior mass on the latter set concentrates on those step functions that are 
L 1 -close to the true regression function. Step (B) will rely on a uniform 
version of the law of large numbers (LLN). 

Following is a suitable version of the LLN. Given a data set (X, Y)„ of 
size n and an interval J, say that J is e-bad (relative to the data set) if at 
least one of the following inequalities holds: 

(30) \N(J)-n\J\\>en\J\, 



(31) 



N s (J)-n / f(x)dx 



J 



>en\J\ 



or 

' tFi 



(32) 



N*(J)-n / (l-f(x))dx 



.1 



>en\J\. 



Here \J\ denotes the Lebesgue measure of J. Given x £ [0, 1], say that x is 
(e, ft)-bad (relative to the data set) if there is an e-bad interval J of length 
| J\ > n/n that contains x. Let B n (e, k) be the set of (e, K)-bad points relative 
to (X,Y) n . 

Proposition 2. For any e > 0, there exist positive constants k,7,C 
such that, for every sample size n > 1, 

(33) P f {\B n (e,K)\>e}<Ce-^ n . 

The exponential estimates, in conjunction with the Borel-Cantelli lemma, 
yield as a consequence a uniform strong law of large numbers. For the proof 
of Theorem 1, only a weak law is needed; however, it is no more difficult to 
establish the exponential bounds (33). The proof of Proposition 2 is deferred 
to Appendix B. 
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3. Beginning and middle zones. Following [5] , we designate three asymp- 
totic "zones" where the predictive probabilities Z m ((X, Y)„) decay at dif- 
ferent exponential rates. These are determined by the relative sizes of m, 
the number of discontinuities of the step functions, and n, the sample size. 
The end zone is the set of pairs (m,n) such that m/n > e; this zone will be 
analyzed in Sections 4 and 5, where we shall prove that the asymptotic de- 
cay of Z m ((X., Y) n ) is faster than in the middle zone, where K <m< en for 
a large constant K. The beginning zone is the set of pairs (m,n) for which 
m < K for some large K. A regression function cannot be arbitrarily well 
approximated by step functions with a bounded number of discontinuities 
unless it is itself a step function, and so, as we will see, the asymptotic decay 
of Z m ((X, Y) n ) is generally faster in the beginning zone than in the middle 
zone. 

In this section we analyze the beginning and middle zones, using the Beta 
concentration property, Lemma 1 and Proposition 2. In the beginning and 
middle zones, the number m of split points is small compared to the number 
n of data points, and so for typical split-point vectors u, most intervals in 
the partition induced by u will, with high probability, contain a large num- 
ber of data points. Consequently, the law of large numbers applies in these 
intervals: together with the Beta concentration property, it ensures that the 
Qu-posterior is concentrated in an L 1 -neighborhood of / u , and that the 
Q u -predictive probability is roughly exp{— nH(f u )}. The next proposition 
makes this precise. 

Proposition 3. For each 5>0, there exists e > such that the follow- 
ing is true: For all sufficiently large n, the Pf -probability is at least 1 — 5 
that, for all m < en and all split-point vectors u of size m, 

(34) Q u ({g: \\g - / u ||i > J}|(X, Y) n ) < 5 
and 

(35) \n- 1 logZ u ((X,Y) n )+H(f u )\<5. 

Proof. Let Ji = Jj(u) be the intervals in the partition induced by u. 
Fix k = k(5) as in Proposition 2. If e is sufficiently small, then for any split- 
point vector u of size m<en, the union of those Jj of length < k/h will have 
Lebesgue measure < 5: this follows by a trivial counting argument. Let B u 
be the union of those J, that are either of length < n/n or are 5-bad [in the 
sense of inequalities (30)-(32)]. By Proposition 2, the ^/--probability of the 
event G c that there exists a split-point vector u of size m< en for which the 
Lebesgue measure of B u exceeds 25 is less than e, for all large n. But on the 
complementary event G, inequality (34) must hold (with possibly different 
values of 5) by the Beta concentration property. 
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For the proof of (35), recall that by (12), 

m 

(36) n- 1 log Z„((X, Y) n ) = n' 1 £ log B(N?, N t F ), 

i=0 

where B{k,l) is the Beta function (using our convention for the arguments). 
By the Stirling approximation (20), each term of the sum for which Ni is 
large is well approximated by —NiH(Nf/N); and for each index i such that 
Ji B u , this in turn is well approximated by n\Ji\H(f u (Ji)), where f u (Ji) 
is the average of / on the interval Ji. If B u were empty, then (35) would 
follow directly. 

By Proposition 2, Pf(G c ) < e for all sufficiently large n. On the comple- 
mentary event 67, the Lebesgue measure of the set B u of "bad" intervals Jj is 
< 25. Because the intervals Ji not contained in B u must have approximately 
the expected frequency n\Ji\ of data points, by (30), the number of data 
points in B u cannot exceed 45, on the event G. Since 1 > B{k,l) > 4r k ~ l , 
it follows that the summands in (36) for which Ji C B u cannot contribute 
more than 45 log 4 to the right-hand side. Assertion (35) now follows (with 
a larger value of 5). □ 

Corollary 4. For each 5 > 0, there exist e > and K < oo such that 
the following is true: If K <m < en and n is sufficiently large, then with 
P f -probability at least 1 — 5, 

(37) Q m ({ 5 :||s-/||i><5}|(X,Y) n )<<5 
and 

(38) |n- 1 logZ m ((X,Y) n ) + i/(/)|<5. 

Proof. For large m (say, m > K), most split-point vectors u (as mea- 
sured by the uniform distribution on [0, l] m ) are such that all intervals 
Jj(u) in the induced partition are short — this follows, for instance, from 
the Glivenko-Cantelli theorem — and so, by Lemma 1, ||/ — / u ||i is small. 
Thus, for any a, e > 0, there exists K < oo such that if m > K , then the set 

B m (a):={ue[0,ir:||/-/ u ||i>a} 

has Lebesgue measure < e. Inequality (34) of Proposition 3 implies that, for 
each u in the complementary event B^a), the Q u -posterior distribution is 
concentrated on a small .^-neighborhood of /, provided a is small. Thus, to 
prove (37), it must be shown that the contribution to the Q m -posterior (14) 
from split-point vectors u € B m (a) is negligible. For this, it suffices to show 
that the predictive probabilities Z U ((X, Y) n ) are not larger for u G B m {a) 
than for u G B^a). 
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By the entropy concavity property, H(f u ) > H(f) for all split-point vec- 
tors u, and for u G B m (a), 

H(f)>H(f) + Ca, 

where 2C = m&xH"(p) [see (27)]. On the other hand, by the Entropy con- 
tinuity property, if p > is sufficiently small, then for all u ^ B m (p), 

\H(f)-H(f u )\<Ca/2. 

By the second assertion (35) of Proposition 3, for all sufficiently large n 
there is Pj-probability at least 1 — 5 that n~ 1 logZ u ((X, Y) n ) is within 5 of 
—H{f u ) for all u. Consequently, the primary contribution to the integral (14) 
must come from u ^ B m (a). This proves (37). Assertion (38) also follows, in 
view of the representation (15) for the predictive probability Z m ((X, Y) n ). 
□ 

The exponential decay rate of the predictive probabilities in the begin- 
ning zone depends on whether or not the true regression function / is a 
step function. If not, the decay is faster than in the middle zone; if so, the 
decay matches that in the middle zone, but the posterior concentrates in a 
neighborhood of /. 

Corollary 5. // the regression function f is a step function with k 
discontinuities in (0,1), then for each m > k and all e > 0, inequalities 
(37) and (38) hold with Pf -probability tending to 1 as the sample size n — > oo. 
If f is not a step function with fewer than K + 1 discontinuities, then there 
exists e > such that, with Pf -probability — > 1 as oo, 

(39) maxZ m ((X,Y) n ) <exp{-nH(f) -ne}. 

m<K 

Proof. If / is not a step function with fewer than K + 1 discontinuities, 
then by the Entropy concavity property there exists e > so that H(f u ) is 
bounded above by H(f) + e for all split-point vectors u of length m < K. 
Hence, (39) follows from (35), by the same argument as in the proof of 
Corollary 4. 

Suppose then that / is a step function with k discontinuities, that is, 
/ = /u* for some split-point vector u* of length k. For any other split-point 
vector u, the entropy H(f u ) cannot exceed H(f) by the Entropy concavity 
property, and so (35) implies that, for any m, the exponential decay rate 
of the predictive probability Z m ((X, Y) n ) as n — > oo cannot exceed —H(f). 
But since / is a step function with k discontinuities, any open L 1 neighbor- 
hood of / has positive 7r m -probability; consequently, by entropy continuity 
and (35), the exponential decay rate of Z m ((X,Y n )) in n must be at least 
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—H{f). Thus, (38) holds with Pj-probability — > 1 as n — > oo. Finally, (39) 
follows by the same argument as in the proof of Corollary 4. □ 

Corollaries 5 and 4 imply that, with Pj-probability — ► 1 as n — > oo, the 
Q^-posterior in the beginning and middle zones concentrates near /, and 
that the total posterior mass in the beginning and middle zones decays at the 
exponential rate H(f) asm oo. Thus, to complete the proof of Theorem 1, 
it suffices to show that the posterior mass in the end zone m>5n decays at 
an exponential rate > H(f). This will be the agenda for the remainder of 
the article: see Proposition 6 below. 

4. The end zone. For the Diaconis-Freedman priors, the log-predictive 
probabilities simplify neatly as sums of independent random variables, and 
so their asymptotic behavior drops out easily from the usual WLLN. No 
such simplification is possible in our case: the integral in (15) does not admit 
further reduction. Thus, the analysis of the posterior in the end zone will 
necessarily be somewhat more roundabout than in the Diaconis-Freedman 
case. The main objective is the following. 

Proposition 6. For any Borel measurable regression function f ^ 1/2 
and all e > 0, there exist constants 5 = 5(e, f) > such that 

(40) lim pA sup lo g Z m ((X,Y) n )>n(-H(f)-S))=0. 

Given this result, the consistency theorem follows. 

Proof of Theorem 1. Proposition 6 implies that, for any e > 0, with 
high Pj-probability the posterior mass (un-normalized) in the region m>en 
is less than exp{— nH{f) — n5}. Corollaries 4 and 5 imply that there exists 
e' > so that, with Pj-probability tending to one as n — > oo, the posterior 
mass in the region m < e'n is at least exp{—nH(f) — n8/2}. Consequently, 
for any e > 0, the posterior mass will, for large n, with high Pj-probability be 
almost entirely concentrated on step functions with m<en discontinuities. 

Assertion (37) of Corollary 4, together with Corollary 5, implies that 
for some e" > 0, most of the posterior mass in the region m < e"n will, 
with high Pf-probability be concentrated on step functions near /. Since by 
the preceding paragraph nearly all of the posterior mass will eventually be 
concentrated in the region m < e"n, the result (1) follows. □ 

To prove Proposition 6, we will show in Proposition 11 below that the 
predictive probabilities (after suitable "Poissonization" ) decay exponentially 
in n at a precise rate, depending on a > 0, for m/n — > a > 0. 
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4.1. Preliminaries: Comparison and Poissonization. Comparison argu- 
ments will be based on the following simple observation. 

Lemma 7. Adding more data points {x^yi) to the sample (x, y) de- 
creases the value of Z m (pc, y). 

Proof. For each fixed pair u,w 6 [0, l] m , adding data points to the 
sample increases at least some of the cell counts Nf,Nf, and therefore, 
decreases the integrand in (12). □ 

Two Poissonizations will be used, one for the data sample, the other for 
the sample of split points. Let A(t) be a standard Poisson counting process 
of intensity 1, independent of the data stream (X, Y). Replacing the sample 
(X, Y) n of fixed size n by a sample (X,Y)jw n ) of size A(n) has the effect of 
making the success/failure counts in disjoint intervals independent random 
variables with Poisson distributions. 

Lemma 8. For each e > 0, the probability that 

(41) Z m ((X, Y) A ( n _ en )) < Z m ((X, X) n ) < Z m ((X, Y) A ( n+en )) 
for all m approaches 1 as oo. 

Proof. For any e > 0, P{A(n — en) <n< A((l + e)n)} — > 1 as n — > cx>, 
by the weak law of large numbers. On this event, inequality (41) must hold 
by Lemma 7. □ 

The second Poissonization, for the split point vector, involves mixing the 
priors 7r m according to a Poisson hyperprior. For any A > 0, let tt x be the 
Poisson- A mixture of the priors 7r m , and let Q* x be the corresponding induced 
measure on data sequences (equivalently, is the Poisson-A mixture of the 
measures Q m ). Then the Q^-predictive probability for a data set (x,y) is 
given by 

OO \ k —\ 

(42) ^(x,v):=]T-^Z,,(x,v). 

k=o K - 

The effect of Poissonization on the number of split points is a bit more 
subtle than the effect on data, because there is no simple a priori rela- 
tion between neighboring predictive probabilities Z m (x,y) and Z m+ i(x,y). 
However, because the Poisson distribution with mean an assigns mass at 
least C/y/n to the value [an] (by the Local CLT), where C = C(a) > is 
continuous in a, the following is obviously true. 
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Lemma 9. For each e > 0, there exists C < oo such that, for all a G 

(43) Z [an] ((X, Y) A(n) ) < C^Z* n ((X, Y) A(n) ). 

Consequently, to prove Proposition 6 it suffices to prove that (40) ZioWs u>/ien 
Z m ((X, Y) n ) is replaced by Z^((X,Y) n ). 

Whereas it is difficult to compare neighboring predictive probabilities 
Z m (x,y) and Z m+ i(x,y), it is quite easy to compare Poissonized predictive 
probabilities Z^(x,y) and Z*(x,y) for neighboring intensities [i, A. 

Lemma 10. For each e > and yl < oo, £/iere exists S > smc/i i/iat i/ 
A < .A and |/i — A| < 5, then for all n>l and all data sets (x,y) of size 

n, 

(44) ^(^y) 

Proof. Inequality (22) implies that Z^(x,y) > 4r n . Chernoff's large 
deviation inequality implies that if M has the Poisson distribution with 
mean A < An, then 



P{M > ku} < e 



■yn 



where 7 — > 00 as k — > 00. Since Zk(x,y) < 1, it follows that the contribution 
to the sum (42) from terms indexed by k > nn is of smaller exponential order 
of magnitude than that from terms indexed by k < ku, provided 7 > log 4. 

Consider the Poisson distributions with means /in, A?t, < kk: these are 
mutually absolutely continuous, and the likelihood ratio at the integer value 
k is 

(fi/X) k e nX ~ n ^. 

If k < nn and \n — A| is sufficiently small, then this likelihood ratio is less 
than e ne . By the result of the preceding paragraph, only values of k < ku 
contribute substantially to the expectations; thus, the assertion follows. □ 



In some of the arguments to follow, an alternative representation of these 
Poissonized predictive probabilities as a conditional expectation will be use- 
ful. Assume that on the underlying probability space (Q,J-,Pf) are defined 
i.i.d. uniform-[0, 1] r.v.s U n and independent Poisson processes A,M, all 
jointly independent of the data stream. Then 



(45) 



Z^((X,Y f ) A(n) ) = J B / (/3|(X,Y) A(n) ), 
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where 

M(X) 

(46) /? = /3(U A/(A) ;(X,Y f ) A(n) ):= J] B(Nf,N?) 

i=0 

and Nf,Nf are the success/failure cell counts for the data (X, Y) A ( n ) rel- 
ative to the partition induced by the split point sample U^^-j. 

4.2. Exponential decay. The asymptotic behavior of the doubly Pois- 
sonized predictive probabilities is spelled out in the following proposition, 
whose proof will be the goal of Sections 4.4-4.6 and Section 5 below. 

Proposition 11. For each Borel measurable regression function f and 
each a > 0, there exists a constant ipf(ce) such that, as oo, 

(47) n- 1 log^ n (((X,Y) A(n) ))^V/(«). 
The function ipf(a) satisfies 

(48) i/>f(a)= [\{f{x),a)dx, 

Jo 

where ijj(p,a) =ip p (a) is the corresponding limit for the constant regression 
function f = p. The function ip(p, a) is jointly continuous in p, a and satis- 
fies 

(49) lim max \il)Ja) +log2| =0 

a->oop e [o,l] 

and 

(50) ^ p (a) < -H{p). 

Note that the entropy inequality (50) extends to all regression functions 
/: that is, p may be replaced by / on both sides of (50). This follows from the 
integral formulas that define iftf(ct) and H{f). The fact that this inequality 
is strict is crucially important to the consistency theorem. It will also require 
a rather elaborate argument: see Section 5 below. 

The case / = p, where the regression function is constant, will prove to 
be the crucial one. In this case, the existence of the limit (47) is somewhat 
reminiscent of the existence of "thermodynamic limits" in formal statisti- 
cal mechanics (see [21], Chapter 3). Unfortunately, Proposition 11 cannot 
be reduced to the results of [21], as follows: (i) the data sequence enters 
conditionally (thus functioning as a "random environment"); and, more im- 
portantly, (ii) the hypothesis of "tempered interaction" needed in [21] cannot 
be verified here. The limit (47) is also related to the "conditional LDP" of 
Chi [2], but again cannot be deduced from the results of that paper, because 
the log-predictive probability cannot be expressed as a continuous functional 
of the empirical distribution of split point/data point pairs. 
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4.3. Proof of Proposition 6. Before proceeding with the somewhat ardu- 
ous proof of Proposition 11, we show how to complete the proof of Propo- 
sition 6. In the process, we shall establish the asymptotic behavior (49) of 
the rate function. 

Lemma 12. For every 5 > 0, there exists a$ < oo such that 

(51) lim p\ sup su P Z*„((X,y) A(n) ) > 2~ n+nS X =0 
and 

(52) lim P{ inf inf Z* n ((X,y) A(n) ) < 2— s } = 0. 

Here sup y and inf y are taken over all assignments of 0s and Is to the re- 
sponse variables yi, j/2> ■ • ■ j UA(n) ■ Similarly, 

(53) lim p{ sup su P Z m ((X,y) n ) > 2~ n+nS } = 
and 

(54) lim p{ inf inf Z m ((X,y) n ) < 2~ n ~ n5 } = 0. 

n^oo [m>a s n y J 

Given the convergence (47), the following is now immediate from (51)- 
(52). 

Corollary 13. For every regression function f , 

(55) lim ibf (a) = — log2. 

a— >oo ' 

Proof of Lemma 12. We shall prove only (51)-(52); the other two 
assertions may be proved by similar arguments. Let £i, £2, • • • , ^A(n)+i be the 
spacings between successive order statistics of the covariates X\, X2, ■ ■ ■ , ^A(n) 
For each pair of positive reals e,8 > 0, let G = Gs )£ be the event that at 
least (1 — 5)n of the spacings £j are larger than e/n. Call these spacings 
"fat." Since the spacings are independent exponentials with mean 1/n, the 
Glivenko-Cantelli theorem implies that there exist 5 = 5(e) — > as e — ► 
such that 

lim P(G s , e n {|A(n) - n| < en}) = 1. 

By elementary large deviations estimates for the Poisson process, given 
G, the probability that a random sample of M(an) split points is such that 
more than (1 — 25)n of the fat spacings contain no split points is less than 
exp{— 717}, where 7 = 7(0, e, S) — > 00 as a — ► 00. But on the complement of 



BAYES ESTIMATION OF A REGRESSION FUNCTION 



23 



this event, at least (1 — 45)n of the intervals induced by the split points have 
exactly one data point. Thus, on the event 67 n {|A(n) — n\ < £n}, 

2~ n+4n<5 ^—ASn—en ^ B(N? jV^) < 2~ n +^ n 

i 

Observe that these inequalities hold regardless of the assignment y of values 
to the response variables. Thus, taking conditional expectations (45) given 
the data (X, Y)A( n ) > we obtain 

(56) (1 - e -n, )2 -n-4nS-28n < ^ (X A(n) , Y A(n) ) < 2""+ 4 ^ + e~ n \ 

Since 7 can be made arbitrarily large by making a large, assertions (51) and (52) 
follow. □ 

Proof of Proposition 6. Since H(f) < log 2 for every regression 
function f ^ 1/2, Lemma 12 implies that, to prove (40), it suffices to re- 
place the supremum over m > en by the supremum over m E [en,e~ l n\. 
Now for m in this range, the bound (43) is available; since logn is negligible 
compared to n, (43) implies that 

sup logZ m ((X,Y) n ) 

en<m<e~ 1 n 

may be replaced by 

sup logZ* n ((X,Y) A(n) ) 

e<Q<e _1 

in (40). Lemma 10 implies that this last supremum may be replaced by a 
maximum over a finite set of values a, and now (40) follows from assertions 
(47), (48) and (50) of Proposition 11. □ 

4.4. Constant regression functions. The simplest route to the conver- 
gence (47) is via subadditivity (more precisely, approximate subadditivity) 
arguments. Assume that / = p is constant, and that the constant p 7^ 0, 1. 
Recall (Section 2.4) that, in this case, the integral (15) defining the predictive 
probability almost factors perfectly into the product of two integrals, one 
over the data and split points in [0, 1/2], the other over (1/2, 1], of the same 
form [but on a different scale — see (18)]. Unfortunately, this factorization is 
not exact, as the partition of the unit interval induced by the split points Ui 
includes an interval that straddles the demarcation point 1/2. However, the 
error can be controlled, and so the convergence (47) can be deduced from a 
subadditive WLLN (Proposition A.l of the Appendix A). The next lemma 
shows that the hypotheses of Proposition A.l are met. 

Lemma 14. Fix a > 0, and write Q n = logZ* n (((X, Y)A( n )))- F° r each 
pair m, n G N of positive integers, there exist random variables (' mm+n , 

Cn m+n an ^ Rm,n Such that: 
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(57) 




Proposition A.l of the Appendix A now implies the following. 



Corollary 15. For some constant ip p (a) 



(58) 




ip p {a). 



Proof of Lemma 14. The construction requires auxiliary randomiza- 
tion, and so we assume that independent copies of the data sequence and 
the various Poisson processes are available. Consider the expectation (45) 
that defines the Poissonized predictive probability exp{( m+n }. This expec- 
tation extends over all samples of split points of size M(am + an). Since 
the integrand is positive, the expectation exceeds its restriction to the event 
G that there are split points in both of the intervals [b — (m + n) , b] and 
(b, b+ (jn + n)" 1 ], where b:=m/(m + n). Note that this event has probability 



Denote by U',U" the split points nearest the demarcation point b to its 
left and right, respectively. The product (3 = Y\B(Nf , Nf) may be factored 
into three parts, consisting of terms indexed by intervals J{ contained in 
[0, U'], intervals contained in (U",l] and the single interval J* = (U',U"] 
that straddles the point b. Conditional on the values of U',U", the three 
products are independent: by the scaling relation (18), the first two have 
the same distributions as the products j3 occurring as integrands in the 
expectations defining 



where and A^ are the numbers of successes and failures in the interval 
J*, and A* = A* + A„f . Note that on G the random variable N* is dominated 
by a Poisson random variable A** with mean 2, since the length of J* is less 
than 2/(m + n) (this requires auxiliary randomization). Now extend each 



5 = 5(a) :={l-e- a ) 2 . 



} and exp{C(i_c/»)( m+n )}, 



respectively; and the third is just 



B(N^N^)>2~ N "/(N, + 1)>A 



—N* 
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of the first two products in the following manner: throw new, independent 
data samples and split points into the intervals (U',b\ and (b,U"); remove 
the split points U' , U" and place a split at b; then recompute the partitions 
and replace the affected terms B(Nf,Nf) by the new values. Note that this 
cannot increase the value of either product; moreover, the products remain 
conditionally independent given U',U". Most importantly, the conditional 
expectations of these products (given the data and the values of U' , U") have 
the same distributions as exp{£ m } and exp{£ n }, respectively. Thus, 

exp{C m +„} > 5exp{C}exp{C;'}4- iV ", 

where C' m and Cn are independent, with the same distributions as Cm and 
Cn> respectively, and 2V** is Poisson with mean 2. □ 

Remark. There is a similar (and in some respects simpler) approximate 
subaddivitivity relation among the distributions of the random variables Cn- 
For each pair m,n > 1 of positive integers, there exist independent random 
variables £ m m+n , Cn m+n whose distributions are the same as those of Cm, £ n , 
respectively, such that 

(59) Cm+n < Cm,m+n + C,m+n + lo § A ( m + n ) • 

Corollary 15 can also be deduced from (59), but this requires a more so- 
phisticated subadditive LLN than is proved in Appendix A, because the 
remainders logA(m + n) are not uniformly L 1 bounded, as they are in (57). 
This approach has the advantage that it leads to a proof that the conver- 
gence (58) holds almost surely. 

Proof of (59) . Consider the effect on the integral (15) of adding a split 
point at b = m/{m + n): This breaks one of intervals Jj into two, leaving 
all of the others unchanged, and so the effect on the integrand in (15) is 
that one of the factors B(Nf , Nf) is replaced by a product of two factors 
B(N[, N[)B(N%, Ng). By (23), the multiplicative error in this replacement 
is bounded above by A(m + n). After the replacement, the factors in the 
integrand [3 = Y\B(Nf , Bf) may be partitioned neatly into those indexed 
by intervals left of b and those indexed by intervals right of b: thus, 

= f3'(3", 

where [3, f3" are independent and have the same distributions as the products 
(3 occurring as integrands in the expectations defining exp{£ m } and exp{£ n }, 
respectively. Thus, 

exp{Cm+n} < exp{£ m } exp{^}A(m + n), 

where £' m = £' m m+n and ^ = ^ m+n are independent and distributed as 
Cm and Cm respectively. □ 
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4.5. Piecewise constant regression functions. The next step is to extend 
the convergence (47) to piecewise constant regression functions /. For ease 
of exposition, we shall restrict attention to step functions with a single dis- 
continuity in (0,1); the general case involves no new ideas. Thus, assume 
that 

f(x) = pl for x < b, 
f(x)=PR forx>b, 

and 

PL 7^ PR- 
Fix a > 0, and set 

(60) Z::=Z* n ((X,Y) A(re) ). 

Lemma 16. With Pf -probability approaching one as n^oo, 

(61) Z* > Z' n Z'^/n 2 
and 

(62) Z* < 2nZ' n Zl 

where, for each n, the random variables Z' n ,Z'^ are independent, with the 
same distributions as 

z 'n = Z anb((^' Y )A(bn)) under P PL ; 

(63) ^ 

Z n = Z an-anb{(^^)\(n~nb)) Under P pR . 

Proof. Consider the effect on Z* of placing an additional split point at 
b: this would divide the interval straddling b into two nonoverlapping inter- 
vals L, R (for "left" and "right"), and so in the integrand :=HB(N?, Nf) 
the single factor B(Nf , N^) representing the interval straddling b would be 
replaced by a product of two factors B(N[,N[) and B(N^,N^). As in 
the proof of the subadditivity inequality (59) in Section 4.4, the factors of 
this modified product separate into those indexed by subintervals of [0, b] 
and those indexed by subintervals of [b, 1] ; thus, the modified product has 
the form /3'/3", where j3' and j3" are the products of the factors indexed by 
intervals to the left and right, respectively, of b. Denote by Z' n and Z'^ the 
conditional expectations of j3' and j3" (given the data). These are indepen- 
dent random variables, and by the scaling relation (18), their distributions 
satisfy (63). By inequality (23), the multiplicative error in making the re- 
placement is at most A(n); since the event A(n) > 2n has probability tending 
to as n — > oo, inequality (62) follows. 
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The reverse inequality (61) follows by a related argument. Let G be the 
event that the data sample (X,Y)a(„) contains no points with covariate 
Xi £ [5, b + n~ 2 ] . Since the covariates are generated by a Poisson point pro- 
cess with intensity n, the probability of G c is approximately ra -1 . Consider 
the integral (over all samples of split points) that defines Z*: this integral 
exceeds its restriction to the event A that there is a split point in [b, b + n~ 2 ]. 
The conditional probability of A (given the data) is approximately an , 
and thus larger than n~ 2 for large n. On the event Gf]A, 

a =00' 

holds exactly, as the split point in [6, b + n~ 2 \ produces exactly the same bins 
as if the split point were placed at b. Moreover, conditioning on the event 
A does not affect the joint distribution (conditional on the data) of 0,0' 
when G holds. Thus, the conditional expectation of the product, given A 
and the data, equals Z^Z'^ on the event G. □ 

Taking nth roots on each side of (61) and appealing to Corollary 15 now 
yields the following. 

Corollary 17. // the regression function is piecewise constant, with 
only finitely many discontinuities, then the convergence (47) holds. 

4.6. Thinning. Extension of the preceding corollary to arbitrary Borel 
measurable regression functions will be based on thinning arguments. Re- 
call that if points of a Poisson point process of intensity X(x) are ran- 
domly removed with location-dependent probability q(x), then the resulting 
"thinned" point process is again Poisson, with intensity X(x) — g(x)X(x). 
This principle may be applied to both the success (y = 1) and failure (y = 0) 
point processes in a Poissonized data sample. Because thinning at location- 
dependent rates may change the distribution of the covariates, it will be 
necessary to deal with data sequences with nonuniform covariate distribu- 
tion. Thus, let (X,Y) be a data sample of random size with Poisson-A 
distribution under the measure Pf f (here / is the regression function, F 
is the distribution of the covariate sequence Xj). If successes (x,y = 1) are 
removed from the sample with probability Qi(x) and failures (x,y = 0) are 
removed with probability Qo(x), then the resulting sample will be a data 
sample of random size with the Poisson- distribution from -P 5i Gi where the 
mean u, the regression function g and the covariate distribution G satisfy 

fig(x)G(dx) = (1 — Qi(x))Xf(x)F(dx) and 

(64) 

- g{x))G{dx) = (1 - a,(s))A(l - f(x))F(dx). 
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By the monotonicity principle (Lemma 7), the predictive probability of the 
thinned sample will be no smaller than that of the original sample. Thus, 
thinning allows comparison of predictive probabilities for data generated by 
two different measures P^p and P g> G- The first and easiest consequence is 
the continuity of the rate function. 

Lemma 18. The rate function ip p {a) is jointly continuous in p,a. 

Proof. Corollary 15 and Lemma 10 imply that the functions a t— > ip p (ct) 
are uniformly continuous in a. Continuity in p and joint continuity in p, a 
are now obtained by thinning. Let (X, Y) be a random sample of size A(n) ~ 
Poisson-n from a data stream distributed according to P p (i.e., f = p and 
F is the uniform-[0, 1] distribution). Let (X,Y)' be the sample obtained 
by randomly removing failures from the sample (X, Y), with probability e. 
Then (X,Y)' has the same distribution as a random sample of size A(n — 
eqn) (here q = 1 — p) from a data stream distributed according to P p > , where 
p' =p/(l — eq). By the monotonicity principle (Lemma 7), 

Z^((X,Y))<Z^((X,Y)'). 

Taking nth roots and appealing to Corollary 15 shows that 

ip(p,a) < (l-eq)- 1 i;(p/(l-eq),a/(l-eq)). 

A similar inequality in the opposite direction can be obtained by reversing 
the roles of p and p/(l — eq). The continuity in p of tp{p,a) now follows 
from the continuity in a, and the joint continuity follows from the uniform 
continuity in a. □ 

Proposition 19. The convergence (47) holds for every Borel measur- 
able regression function f . 



Proof. By Corollary 17 above, the convergence holds for all piecewise 
constant regression functions with only finitely many discontinuities. The 
general case will be deduced from this by another thinning argument. 

If / : 0, 1 — > [0, 1] is measurable, then for each e > 0, there exists a piecewise 
constant g : [0, 1] — > [0, 1] (with only finitely many discontinuities) such that 
11/ ~~ sill < e - If e i s small, then |/ — g\ must be small except on a set B of 
small Lebesgue measure; moreover, g may be chosen so that g = 1 wherever 
/ is near 1, and g = wherever / is near (except on B). For such choices 
of e and g, there will exist removal rate functions Qo(x) and Q\(x) so that 
equation (64) holds with F = the uniform distribution on [0, 1] , G = the 
uniform distribution on [0, 1] — B, and 

A 



1 



<8(e) 
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for some constants 5(e) — > as e — > 0. (Note: Requiring G to be the uniform 
distribution on [0,1] — B forces complete thinning in B, i.e., go = g± = 1 
in B.) Thus, a Poissonized data sample distributed according to Pf may 
be thinned so as to yield a Poissonized data sample distributed according 
to P g ,G i n such a way that the overall thinning rate is arbitrarily small. 
It follows, by the monotonicity principle, that the Poissonized predictive 
probabilities for data distributed according to Pf are majorized by those for 
data distributed according to P g a, with a slightly smaller rate. 

Now consider data (X, Y) distributed according to P g ,G- Since g is piece- 
wise constant and G is a uniform distribution, the transformed data (G7X, Y) 
will be distributed as Ph, where h is again piecewise constant. Moreover, 
since the removed set B has small Lebesgue measure, the function h is 
close to the function g in the Skorohod topology, and so by Lemma 18, 
VVi ~ "03 ~ V 1 / • Because the convergence (47) has been established for piece- 
wise constant regression functions h, it now follows from the monotonicity 
principle that 

Pfin" 1 log Z*„(((X, Y) A(n) )) > V/(«) + <5} — 

for every 5 > 0. This proves the upper (and for us, the more important) half 
of (47) . The lower half may be proved by a similar thinning argument in the 
reverse direction. □ 

5. Proof of the entropy inequality (50). This requires a change of per- 
spective. Up to now, we have taken the point of view that the covariates Xj 
and the split points Ui are generated by Poisson point processes in the unit 
interval of intensities n and an, respectively. However, the transformation 
formula (18) implies that the predictive probabilities and hence also their 
Poissonized versions, are unchanged if the covariates and the split points are 
rescaled by a common factor n. The rescaled covariates Xj : = Xj/n and split 
points Ui := Ui/n are then generated by Poisson point processes of intensi- 
ties 1 and a on the interval [0, n] . Consequently, versions of all the random 
variables Zj^ n j((X, Y) A ( n )) may be constructed from two independent Pois- 
son processes of intensities 1 and a on the whole real line. The advantage of 
this new point of view is the possibility of deducing the large-n asymptotics 
from the Ergodic theorem. 

5.1. Reformulation of the inequality. To avoid cluttered notation, we 
shall henceforth drop the hats from the rescaled covariates and split points. 
Thus, assume that under both P = P p and Q, 

• • • < X_i < X Q < < Xx < ■ ■ ■ 

and 



<U-i< U < < Ux < 
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are the points of independent Poisson point processes X and U of intensities 
1 and a, respectively, and let {Wi}i^z be a stream of uniform-[0, 1] random 
variables independent of the point processes X, U. Denote by N(t) the num- 
ber of occurrences in the Poisson point process X during the interval [0,t], 
and set Jj = (Ui,Ui + i]. Let {Y}iez be Bernoulli r.v.s distributed according 
to the following laws: 

(A) Under P, the random variables Y~ are i.i.d. Bernoulli-^, jointly inde- 
pendent of the Poisson point processes U,X. 

(B) Under Q, the random variables Yj are conditionally independent, 
given X,U,W, with conditional distributions Yj ~ Bernoulli- Wi, where i is 
the index of the interval Jj containing Xj. 

Under Q, the sequence {Y n } n£ z is an ergodic, stationary sequence; for rea- 
sons that we shall explain below, we shall refer to this process as the recharge- 
able Polya urn. The distribution of (X, Y) n [0, t] under Q is, after rescaling 
of the covariates by the factor t, the same as that of a data sample of random 
size A(t) under the Poisson mixture Q* at defined in Section 4.1 above. 
For (extended) integers — oo < m < n, define cr-algebras 

Fmji = v{{Xj,Yj} m <j< n ) and T^ in = <r({Yj} m <j< n ). 

If m,n are both finite, then the restrictions of the measures P, Q to J-^n 
(and therefore also to J-^ n ) are mutually absolutely continuous. The Radon- 
Nikodym derivative on the smaller cr-algebra T\ n is just 

dQ\ = q(Y 1 ,Y 2 ,...,Y n ) 
dP) T Y n p(Y 1 ,Y 2 ,...,Y n y 



(65) 
where 



Q(yi,V2, ■ ■ ■ , Vn) ■= Q{Yj = Vj V 1 < j < n} 

and 

p(yi,V2, • • • , Vn) ■= P{Yj = yj V 1 < 3 < n } 

En v^ 71 
i=l^(l _p) n_ 2^ 3 =i%'_ 

The Radon-Nikodym derivative on the larger u-algebra J- 1 ' n cannot be so 
simply expressed, but is closely related to the Poissonized predictive proba- 
bility Z* n ((X, Y) n ) defined by (42). Define 

'dQ\ 



(66) Z n := p(Yi, Y 2 ,..., Y Uj 



dPJ 



T X,Y 
J ~l,n 



then by (16) the random variable Z N ^ has the same distribution under 
P as does the Poissonized predictive probability (42) under Pf, for any /. 
Hence, the convergence (47) must also hold for the random variables Z n : 
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Corollary 20. Under P, as oo, 

(67) n^logZn^U^ia). 

Therefore, to prove the entropy inequality (50), it suffices to prove that 

(68) lim n^Ep log (^] < 0. 



dPj 



Y 



Proof. The first assertion follows directly from (47) of Proposition 11. 
Thus, to prove the entropy inequality tp p (a) < —H(p), it suffices, in view 
of (66), to prove that (68) holds when the cr-algebra T\ n is replaced by 

J- 1 ' n . But the former is a sub-cr-algebra of the latter; since log is a concave 
function, Jensen's inequality implies that 

m < E ^m 



E P log — < E P log 

l,n l.n 

5.2. Digression: The relative SMB theorem. The existence of the limit 
(67) is closely related to the relative Shannon-McMillan-Breiman theorem 
studied by several authors [14, 15, 18, 19]. The sequence Y±,Y 2 , ... is, under 
either measure P or Q, an ergodic stationary sequence of Bernoulli random 
variables. Thus, by the usual Shannon-MacMillan-Breiman theorem [27], as 
n — > oo, 

n- 1 logq(Y 1 ,Y 2 ,...,Y n )^ -h Q 

and 

n" 1 logp(Yi , Y 2 , ...,Y n ) a ^-T -H(p), 

where Hq is the Kolmogorov-Sinai entropy of the sequence Yj under Q. In 
general, of course, the almost sure convergence holds only for the probabil- 
ity measure indicated — see, for instance, [14] for an example where the first 
convergence fails under the alternative measure P. The relative Shannon- 
MacMillan-Breiman theorem of [19] gives conditions under which the dif- 
ference of the two averages 



n 



_! q(Yi,Y 2 ,...,Y n ) _! (dQ\ 

log — =n log 

& p(Y 1 ,Y 2 ,...,Y n ) \dPJrr n 



converges under P. In the case at hand, unfortunately, these conditions are 
not of much use: they essentially require the user to verify that 

n- 1 q(Y 1 ,Y 2 ,...,Y n ) a ±fc 

for some constant C. Thus, it appears that [19] does not provide a shortcut 
to the convergence (47). 
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5.3. The rechargeable Polya urn. In the ordinary Polya urn scheme, balls 
are drawn at random from an urn, one at a time; after each draw, the ball 
drawn is returned to the urn along with another of the same color. If initially 
the urn contains one red and one blue ball, then the limiting fraction 
of red balls is uniformly distributed on the unit interval. The Polya urn 
is connected with Bayesian statistics in the following way: the conditional 
distribution of the sequence of draws given the value of is that of i.i.d. 
Bernoulli-© random variables. 

The rechargeable Polya urn is a simple variant of the scheme described 
above, differing only in that, before each draw, with probability r > 0, the 
urn is emptied and then reseeded with one red and one blue ball. Unlike 
the usual Polya urn, the rechargeable Polya urn is recurrent, that is, if 
V n := (R n ,B n ) denotes the composition of the urn after n draws, then V n is 
a positive recurrent Markov chain on the state space N x N. Consequently, 
{V n } may be extended to n € Z in such a way that the resulting process is 
stationary. Let Y n denote the binary sequence recording the results of the 
successive draws (1 = blue, = red). Clearly, this sequence has the same 
law as does the sequence Y\,Y2, . . . under the probability measure Q [with 
r = a/{l + a)]. 

Lemma 21. For any e > 0, there exists m such that the following is 
true: For any finite sequence y-k, y~k-i, • • • , Vo, the conditional distribution 
of Y m+ i,Y m+ 2, ■ ■ ■ , Y<im given that Yi = yi for all —k<i<0 differs from the 
Q -unconditional distribution by less than e in total variation norm. 

Proof. It is enough to show that the conditional distribution of Y m+ i, 
• • • ) Y<im given the composition Vo of the urn before the first draw differs 
from the unconditional distribution by less than e. Let T be the time of the 
first regeneration (emptying of the urn) after time 0; then conditional on 
T = n, for any n < m, and on Vo, the distribution of Y m+ i, . . . , Yi m does not 
depend on the value of Vo - Thus, if m is sufficiently large that the probability 
of having at least one regeneration event between the first and mth draws 
exceeds 1 — e, then the conditional distribution given Vo differs from the 
unconditional distribution by less than e in total variation norm. □ 

The construction of the sequence Y = Yi, Y2,. . . using the rechargeable 
Polya urn shows that this sequence behaves as a "factor" of a denumerable- 
state Markov chain (in terminology more familiar to statisticians, the se- 
quence Y follows a "hidden Markov model"). Note that the original specifi- 
cation of the measure Q, in Section 5.1 above, exhibits Y as a factor of the 
Harris-recurrent Markov chain obtained by adjoining to the state variable 
the current value of W. It does not appear that Y n can be represented as 
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a function of a finite-state Markov chain; if it could, then results of Kaijser 
[12] would imply the existence of the limit 

lim n- 1 ]agq(Yi,Y 2 ,...,Y n ) 

n— >oo 

almost surely under P, and exhibit it as the top Lyapunov exponent of a se- 
quence of random matrix products. Unfortunately, little is known about the 
asymptotic behavior of random operator products (see [13] and references 
therein for the state of the art), and so it does not appear that (65) can be 
obtained by an infinite-state extension of Kaijser's result. 

5.4. Proof of (68) . Since it is not necessary to establish the convergence 
of the integrands on the left-hand side of (68), we shall not attempt to do 
so. Instead, we will proceed from the identity 

, fiQ v -i P , q(Y u Y 2 ,...,Y n ) qiy k+1 \Y u Y 2 ,...,Y k ) 

(69 n Eplog TF\ =n l^ E P lo &^? \TT^ \F\' 

p(Yi,Y 2 ,...,Y n ) ^ p(Y k+1 \Yi,Y 2 , . . . ,Y k ) 

Because the random variables Yi are i.i.d. Bernoulli-p under P, the con- 
ditional probabilities p(y k +i\yi, U2, ■ ■ ■ > Vk) must coincide with the uncondi- 
tional probabilities p(y k +\). Thus, the usual information inequality (Jensen's 
inequality), in the form Ef\og(g(X)/ f(X)) < for distinct probability den- 
sities /, g, implies that, for each k, 

(70) E P log ^k + i\Yi,Y 2 Y k ) ^ 

P[Y k+1 ) 

with the inequality strict unless the Q-conditional distribution of given 
the past coincides with the Bernoulli-p distribution. Moreover, the left-hand 
side of (70) will remain bounded away from as long as the conditional 
distribution remains bounded away from the Bernoulli-p distribution (in 
any reasonable metric, e.g., the total variation distance). Thus, to complete 
the proof of (68), it suffices to establish the following lemma. 

Lemma 22. There is no sequence of integers k n — > oo along which 

(71) ||g(-|y 1 ,y 2 ,...,y fc j-p(-)|| TV ^o 

in P -probability. 



Proof. This is based on the fact that the sequence of draws Y\_, Y 2 , . . . 
produced by the rechargeable Polya urn is not a Bernoulli sequence, that is, 
the Q- and P-distributions of the sequence Y\,Y 2 , . . . are distinct. Denote by 
q k the Q-conditional probability that l^+i = 1 given the values Yi,Y 2 , . . . ,Y k . 
Suppose that q kn — > p in P-probability; then by summing over successive 



•34 



M. CORAM AND S. P. LALLEY 



values of the last I variables, it follows that qu n -i — y P in -P-probability for 
each fixed I £ N. We will show that this leads to a contradiction. 

Consider the following method of generating binary random variables 
Y\,Y2, . . . , Y2m- first generate i.i.d. Bernoulli-p random variables Yj for —k < 
j < 0; then, conditional on their values, generate Y\ according to then, 
conditional on Y\, generate Y2 according to qk+2', and so on. By the hy- 
pothesis of the preceding paragraph, there is a sequence k n — ► 00 such that, 
for any fixed m, the joint distribution of Y\, Y2, . ■ . , Yi m converges to the 
product-Bernoulli-p distribution. But this contradicts the mixing property 
of the rechargeable Polya urn asserted by Lemma 21 above. □ 

APPENDIX A: AN ALMOST SUBADDITIVE WLLN 

The purpose of this appendix is to prove the simple variant of the sub- 
additive ergodic theorem required in Section 4. For the original subadditive 
ergodic theorem of Kingman, see [16], and for another variant that is useful 
in applications to percolation theory, see [17]. There are two novelties in our 
version: (a) the subadditivity relation is only approximate, with a random 
error; and (b) there is no measure-preserving transformation related to the 
sequence S n . 

Proposition A.l. Let S n be real random variables. Suppose that, for 
each pair m, n > 1 of positive integers, there exist random variables S' m m+n , 
S" m+n and a nonnegative random variable R m ,n such that: 

(a) S' mm+n and S^ m+n are independent; 

(b) S' mm+n has the same distribution as S m ; 

(c) S" nm+n has the same distribution as S n ; 

(d) the random variables {Rm,n}m,n>l are identically distributed; 

(e) ERi t i < 00 and {5 n /n} n >i are uniformly integrable; and 

(f) for all m,n>l, 

(A.l) S'm+n < S m ^ mJt _ n + S nm + n + R m ,n- 

Then 

(A.2) ^ J^ 7:= i iminf :^i. 

Note. The random variables {S n /n} considered in Corollary 15 are 
uniformly bounded, and so the uniform integrability hypothesis (e) holds 
trivially. 

PROOF of Proposition A.l. Since the random variables S' m m+n and 
S'n,m+n are independent, with the same distributions as S m and S n , re- 
spectively, Caratheodory's theorem on extension of measures implies that 
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the probability space may be enlarged so as to support additional random 
variables permitting recursion on the inequality (A.l). Here the simplest re- 
cursive strategy works: from a starting value n = km + r, reduce by m at 
each step. This leads to an inequality of the form 

k k 

(A.3) S km+r <S° r +J2 S 3 m + J2 R v 

3=1 3=1 

where the random variables {<S , ^ l }j>i are i.i.d., each with the same distri- 
bution as S m , and the random variables Rj are identically distributed (but 
not necessarily independent), each with the law of R± i := R. 

The weak law (A. 2) is easily deduced from the inequality (A.3). Note 
first that the special case of (A.l) with m = 1, together with hypothesis 
(d) , implies that ES n < nER + nES\ < oo for every n > 1 , and so 7 < 00 . 
Assume for definiteness that 7 > —00; the case 7 = —00 may be treated by 
a similar argument. Divide each side of (A.3) by km; as k — ► 00, 

5 ° P and —TSL P ESm 



km km ^— ' m m 

3=1 

the latter by the usual WLLN. The WLLN need not apply to the sum Rj-> 
since the terms are not necessarily independent; however, since all of the 
terms are nonnegative and have the same expectation ER < 00, Markov's 
inequality implies that, for any e > 0, 

f 1 A 1 ER 

P\i—YRi>£>< — • 

[ km ~— J J me 

Thus, letting m — > 00 through a subsequence along which ES m /m — ► 7, we 
find that, for any e > 0, 

lim P{S n >nj + ne] = 0. 

n— >oo 

Since the r.v.s S n /n are uniformly integrable, this implies that 

lim P\S n < wy — ne] = 0, 

n— >oo 

because otherwise liminf ES n /n < —7. This proves that S n /n — > 7 in proba- 
bility; in view of the uniform integrability of the sequence S n /n, convergence 
in L 1 follows. □ 

Remark. Numerous variants of this proposition are true, and may be 
established by more careful recursions. Among these are SLLNs for random 
variables satisfying hypotheses such as those given in (59) above, where the 
remainders log A(m + n) are not identically distributed, but whose growth 
is sublinear in m + n. For hints as to how such results may be approached, 
sec [11]. 
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APPENDIX B: PROOF OF PROPOSITION 2 

Recall that an interval J is e-bad if any one of the inequalities (30), (31) 
or (32) holds. For each of these inequalities, there are two possibilities: the 
relevant count N(J), N s (J), N F '(J) may be unusually large or unusually 
small. Thus, there are six distinct ways that J may be e-bad, and hence six 
ways that a point x may be (e, «)-bad. To prove (33), we will partition the 
set B = B n (e, k) into six subsets, one for each possibility, and show that (33) 
holds for each of the six subsets. In fact, since the six inequalities (30)-(32) 
are coupled [in particular, if N S (J) is unusually small, then either N(J) is 
also unusually small or N F (J) is unusually large], it suffices to consider only 
four possibilities: those where N(J) is either unusually large or small, and 
those where N (J) or N F (J) is unusually large. These may all be handled 
in a similar fashion, so we shall consider only the possibilities involving large 
discrepancies of N(J). Thus, set 

B + = \x: sup N(J)/\J\>(l + e)n\, 

I J: xGJ;| J\>n/n > 

B~ = {x: sup N(J)/\J\ < (l-e)n}. 

I J: x€j;| J\>n/n J 

We will show that, for any e > 0, there are positive constants k, 7, C such 
that 

(B.l) P f {\B ± \>e}<Ce-^ n . 

The proof of (B.l) is of a familiar type in the theory of empirical processes: 
see, for instance, [20], Chapter 3 for related arguments. The strategy is to 
bracket each of the bad sets B^ by nearby sets in finite cr-algebras whose 
cardinalities are small compared to e n . To carry out this bracketing, we will 
call on a weak form of the Vitali covering lemma (see [24], Chapter 1, Lemma 
1.6): 

Covering lemma. Let F be a measurable subset of [0, 1] that is covered 
by a collection V of subintervals of [0,1]. Then there exist pairwise disjoint 
intervals I±, I2, ■ ■ ■ inV such that 

(B.2) £!/,■!> 

3 

For definiteness, consider the set B + . This set is, by definition, the union 
of intervals J of lengths > n/n (not necessarily pairwise disjoint!), each 
satisfying 



(B.3) 



N(J) > (l + e)n\J\. 
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By the Covering lemma, there is a collection J of pairwise disjoint intervals 
Ji among these whose lengths sum to at least |/3 + |/5. Because the lengths 
of these intervals are bounded below by n/n, there are at most n/n intervals 
in J. 

Let m = m n be the smallest integer such that m > 8n/(e«;). For each 
interval Ji G J , let J- be the minimal interval containing Jj with endpoints 
of the form j/m, where j G Z. Observe that the intervals J- need not be 
pairwise disjoint, but keep in mind that the intervals Ji are. Moreover, | Jj| < 
\J- \ < \Ji\ + 2/m, so by (B.3), 

N(J l )>(l + e)n(\J' l \-2/m) 

(B.4) 

>(l + e')n\Ji\, 

where e' = e/2. Define B* to be the union of the intervals J[. By construction, 
B* is a union of intervals [j/m, (j + l)/m] and contains \Jj Jf, thus, by (B.4) , 
since the intervals Ji G J are pairwise disjoint, 

(B.5) N(B*)>J2N{Ji)>(l+e')n\B*\. 

J 

Now recall that the collection J was chosen so that Y^j \Ji\ ^ Since 
B* contains [JjJi, it follows that, on the event {|£> + | > e}, 

(B.6) |B*|>e/5. 

Hence, to bound the probability (B.l), it suffices to bound the probability 
that inequality (B.6) obtains. Observe that there are precisely 2 m possibili- 
ties for the set B*. Let B be such a possibility, and suppose that \B\ > e/5. 
Under Pf, the count N(B) has the Binomial distribution with parameters 
n, \ B\. By a standard concentration inequality for the Binomial distribution 
(e.g., Hoeffding's inequality), there exist constants C > and p = p(e, e') > 
such that 

P f {N(B) > (1 + e')n\B\} < Ce~ pn . 

Therefore, 

Pf{\B*\ >£} <C2 m e- pn . 
Finally, recall that m = \8n/(eK)~\ . Thus, if en is sufficiently large, then 

2 m er pn < e" 7n 

for some 7 > 0, and so (B.l) follows for B + . A similar argument (with brack- 
eting from the inside rather than from the outside) applies for B~ . 
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